This assignment is for ETC5521 Assignment 1 & 2 by Team bilby comprising of (Yuheng Cui), (Jimmy Effendy), Weihao Li, and Yan Ma.
Families and friends tend to spend their holidays and weekends in amusement parks. The popularity of amusement park has been growing in popularity in recent years, with worldwide attendance at the top 10 amusement park groups reached half a billion mark for the first time last year (Schneider, 2019), with 4% year-on-year growth (Index & Index/AECOM, 2019).
With this ever increasing popularity, it is therefore reasonable to consider that the safety of amusement rides is subject to substantial public interests (Woodcock, 2014). It was estimated that the annual number of ride-related injuries in North America was 1,289 in 2018; which was 26% higher compared to 2017 (Amusement Parks & Attractions, 2018). International Association of Amusement Parks and Attractions further stated that approximately 11% of those injuries are “serious”, meaning that they result in urgent admission and hospitalization for more than 24 hours for non-medical observation reasons, or causes fatality.
While accidents that occur in amusement parks are arguably not frequent, they generate prominent effects when they happen. International Association of Amusement Parks and Attractions (2017) reported that following a ride malfunction in Australia killing four people, the park and also other venues in Australia suffered considerable declines in attendance. This shows that public confidence, safety, and commercial feasibility are strongly interconnected (Woodcock, 2014).
Thorough evaluations to injury records related to amusement rides are aligned with public interests and are integral to encourage constant improvement in the industry. This paper aims to find factors that have major influence to amusement ride accidents. Uncovering this insight may encourage amusement park owners to utilize their resources more wisely in favor of those groups or equipment that are at the most risk. In addition, this report may also facilitate regulatory bodies when safety standards and regulations are developed.
We believe amusement park must make the effort to reduce the incidents; the effort refers to maintaining equipment, training staff and monitoring visitors. At same time, visitors should also follow the parks regulations and equipment instructions. After all, visitors should be responsible for their own safeties in the first place. Reduce the injuries occur in amusement parks and increase the well beings of society.
Firstly, description of data used in the report and how it is prepared for analysis will be discussed. Then, analysis and findings acquired from the dataset will be presented and discussed. In the end, there will be a conclusion on major findings and discussion on future studies.
The datasets are downloaded from the Github repository of Tidy Tuesday. Tidy Tuesday (2020) is a weekly social data project in R. We are using datasets adopted by its activity on September 10, 2019.
There are two datasets provided in the repository; one originated from data.world and another from Saferparks database. An additional data from Texas Department of Insurance (TDI) about current available insurance policies is used for this report (TDI, 2020a).
tx_injuries.csv)The Texas amusement parks injury dataset originated from data.world is collected by Texas Department of Insurance (TDI) (Millerbernd, 2018). It is a record of any injuries caused by an amusement ride from February the 1st 2013 to February the 1st 2017 occurring in the State of Texas (Millerbernd, 2018). An amusement ride is “any mechanical, gravity, or water device or devices that carry or convey passengers along, around, or over a fixed or restricted route or course or within a defined area for the purpose of giving its passengers amusement, pleasure, or excitement” (Insurance, 2019). According to TDI (2019), a quarterly injury report needed to be submitted by amusement ride owners and operators. TDI further stated that this record relates to any injuries that require medical treatment or result in death.
This dataset has 542 number of observations and 13 number of variables. The name, type and description of each variable in tx_injuries.csv can be found in the data dictionary below.
| variable | class | description |
|---|---|---|
| injury_report_rec | double | Unique Record ID |
| name_of_operation | character | Company name |
| city | character | City |
| st | character | State (all TX) |
| injury_date | character | Injury date - note there are some different formats |
| ride_name | character | Ride Name |
| serial_no | character | Serial number of ride |
| gender | character | Gender of the injured individual |
| age | character | Age of the injured individual |
| body_part | character | Body part injured |
| alleged_injury | character | Alleged injury - type of injury |
| cause_of_injury | character | Approximate cause of the injury (free text) |
| other | character | Anecdotal information in addition to cause of injury |
From the data quality check in Fig 2.1, we find the dataset has high percentage of missing values in variable other and serial_no. However, it has minor effect on our analysis. Thus, they can be removed from the dataset. Besides, there are potential issues with the data types of variable age and injury_date, which can be solved by converting them into integer and date respectively.
Figure 2.1: A visualization of data quality check on Injury records in Texas amusement parks. The data plot is an “at-a-glance” plot generated using the visdat package. The y axis represent each observation, and the x axis represent each variable. In this dataset, most of the variables are provided in “character” with few missing values. Serious data quality issues can be observed in variable serial_no and other, but they are irrelevant to this analysis.
safer_parks.csv)The accidents records in the Saferparks dataset, on the other hand, originated from the U.S. State and Federal safety agencies regulating amusement rides (Saferparks, 2020). Saferparks achieved this by submitting public records requests to those agencies (both federal and states agencies). In some cases, additional requests were submitted to specific agencies to achieve particular goals (Saferparks, 2020). As a result, Saferparks needs to harmonize these datasets into their database.
It has 8351 observations and 23 variables. The name, type and description of each variable in safer_parks.csv can be found in the data dictionary below.
| variable | class | description |
|---|---|---|
| acc_id | double | Unique ID |
| acc_date | character | Accident Date |
| acc_state | character | Accident State |
| acc_city | character | Accident City |
| fix_port | character | . |
| source | character | Source of injury report |
| bus_type | character | Business type |
| industry_sector | character | Industry sector |
| device_category | character | Device category |
| device_type | character | Device type |
| tradename_or_generic | character | Common name of the device |
| manufacturer | character | Manufacturer of device |
| num_injured | double | Num injured |
| age_youngest | double | Youngest individual injured |
| gender | character | Gender of individual injured |
| acc_desc | character | Description of accident |
| injury_desc | character | Injury description |
| report | character | Report URL |
| category | character | Category of accident |
| mechanical | double | Mechanical failure (binary NA/1) |
| op_error | double | Operator error (binary NA/1) |
| employee | double | Employee error (binary NA/1) |
| notes | character | Additional notes |
The data quality check in Fig 2.2 shows 4 variables have too many missing values to be analysed, therefore, we ought to ignore them. acc_date should be converted to date variable. Although manufacturer has many missing values as well, we decided to keep it because it may be insightful.
Figure 2.2: A data quality check on the Saferparks accidents dataset. The y axis represent each observation, and the x axis represent each variable. The data plot is an “at-a-glance” plot. This dataset has serious missing value issues in variable manufacturer, report, notes, mechanical, op_error and employee.
tx_policies.csv)The insurance policies dataset originated from Texas Department of Insurance. This dataset lists the amusement ride current insurance policies in Texas. It has 683 observations and 5 variables. The name, type and description of each variable in tx_policies.csv can be found in the data dictionary below.
| variable | class | description |
|---|---|---|
| Record | integer | Unique ID |
| Name of Operation | Character | Name of operation |
| Expiration Date | Character | The expiration date of the insurance policy |
| Agent | Character | The name of the sales agent |
| Carrier | Character | The name of the carrier |
This dataset has very few missing values. Around 99.9% cases are completed.
As documentations related to accident reports provided by TDI are limited, it is difficult to determine the limitation related to the dataset. In addition to a considerable amount of missing values in some of the variables, data dictionary is not provided by TDI. As a result, a fair amount of guesstimates were required for some of the variables provided. Lastly, there is some inconsistency of format in injury_date variable.
According to Saferparks (n.d.), reporting criteria and its level of details, types of equipment included, and years covered vary widely across year, industry sector, jurisdictions, and other factors. Saferparks further stated that States that are transparent, vigilantly monitor safety incidents, and implement data management systems that are efficient will log higher number of accidents. In other words, having high number of injuries may be an indications of being more attentive to safety, not less (Saferparks, n.d.).
While the dataset can be used to uncover insights of how patrons got hurt in amusement rides, Saferparks do not recommend the dataset to be used for comparison across states, parks, rides or years (Saferparks, n.d.). One of the reasons for this is that State laws in relation to amusement ride related injury reporting vary widely. For instance, it is mandatory to report go-kart accidents in Florida but not in California.
Due to these limitations, the report will not use the Saferparks dataset to analyze nation-wide patterns. Hence, this report will largely focus on amusement park accidents that occur in Texas, United States of America.
A considerable amount of data wangling needed to be done to the TDI injury datasets prior to the analysis. Firstly, there were inconsistencies of format in the date variable where some of the observations were stored in a serial number format that only Excel recognizes (e.g. 39448). Dates with this format needed to be converted to “YYYY-MM-DD” format.
Secondly, the following variables were added to the TDI injury dataset:
injury_year: the year when the observed injury occurredinjury_month: the month when the observed injury occurredinjury_day: the day when the observed injury occurredseason: the season (U.S.) when the observed injury occurredTDI dataset about current insurance policies were also needed to be cleaned. Firstly, the janitor (Firke, 2020) package was used to make the column names tidy. Next, the agent variable were needed to be wrangled as there were many observations that were misspelled.
Finally, the TDI injury dataset were combined with TDI insurance policies dataset.
Much of the data wrangling and transformation process were done by utilizing the dplyr (Wickham, François, Henry, & Müller, 2020) and lubridate (Grolemund & Wickham, 2011) packages.
The primary question in this report is to discover the patterns in age and gender distribution of amusement park accidents. Consequently, help parks to identify certain high-risk groups and also remind the groups taking care of their own. The Saferparks dataset cannot be used here, because the dataset does not have detail individual records. Instead, we use the injury dataset reported by TDI.
Figure 3.1: A faceted bar plot for age distribution in amusement park accident records. Missing values are removed. The x axis is the age, the y axis is the percentage. interesting thing
| Age | Percent |
|---|---|
| (10,15] | 15.96 |
| (5,10] | 12.32 |
| (15,20] | 10.51 |
| (30,35] | 9.89 |
| (25,30] | 7.47 |
| (35,40] | 6.86 |
| (0,5] | 5.24 |
| (40,45] | 4.64 |
| (45,50] | 4.44 |
| (20,25] | 3.63 |
| Gender | Age | Percent |
|---|---|---|
| F | (10,15] | 8.68 |
| M | (10,15] | 7.28 |
| M | (5,10] | 6.87 |
| F | (15,20] | 5.86 |
| F | (30,35] | 5.85 |
| F | (5,10] | 5.45 |
| F | (35,40] | 4.65 |
| M | (15,20] | 4.65 |
| M | (30,35] | 4.04 |
| M | (25,30] | 3.83 |
Table 3.1 shows the ranking of injured ages. Injured babies occupies around 5% of the dataset. In the dataset 0.17% injured are under 16, except babies. We distinguish babies from other age groups because babies are carried by their parents and babies are not able to move to any places by their own.
If we take gender into account, we can see gender distributions by age. Table 3.2 shows the top-10 rankings by gender and age. In top-10 list, it seems that more girls get injured than boys in amusement parks in Texas.
In summary, the age range of the injured is broad (from 0 to 71). But the most injured are young people, especially for children (under 18). And the fact that around 5 percent of the injured are babies indicates that parents must take care of their babies in amusement parks. They should put their babies in first priority.
First secondary question is “what is the most dangerous equipment in parks”. We count the total number of injuries by device type. Table 3.3 shows top-10 rankings of high-risk equipment. It is not surprising that roller coaster is the equipment that causes the most injuries. Gutierrez (2016) reported that seven cases are related to roller coaster among eight high-profile U.S. amusement park deaths before 2016.
| Device Type | Total Number of Injuries amusement parks accident records in US. Interesting thing |
|---|---|
| Coaster - steel | 879 |
| Trampoline court | 678 |
| Go-kart | 648 |
| Tube slide | 519 |
| Aquatic play area | 337 |
| Coaster - wooden | 227 |
| Body slide | 204 |
| Flume ride | 183 |
| Water slide - undefined | 176 |
| Bowl slide | 175 |
Figure 3.2: Injury description — word cloud. Interesting thing
First, We create a new list containing common words, such as injury, injuries and pain. Second, we remove unimportant words from injury description. Finally, the word cloud is created. We can see that within the word cloud (Fig 3.2) bigger the word more often it appears. Head may be the most frequent word; so, we may assume that most people get hurt on their heads.
In summary, in high proportion of injury cases, the injured get hurt of their upper half of the body. Amusement parks should pay extra attention on roller coaster because unlike other subjects, roller coaster’s failure can cause severe consequence — no one can escape once a roller coaster is launched. First, they should regularly maintain the roller coaster, in order to reduce mechanical failure. Second, train the staff and ask them to check to-do list every time launch the equipments. Third, ask visitors to follow the safety guide. Furthermore, visitors should also protect their head, neck and shoulder. Those body parts are vulnerable and important and they are always injured in amusement park injury cases.
This section will examine whether seasonal trends affected number of amusement ride injuries across the year. Table 3.4 shows that rides related injuries have seasonal trends and they are consistent across the years. The number of injuries occurred in autumn and winter seasons were relatively low. The number started to rise in spring, and reached its peak in summer.
| Season | 2013 | 2014 | 2015 | 2016 | 2017 |
|---|---|---|---|---|---|
| Autumn | 11 | 6 | 9 | 7 | 6 |
| Spring | 30 | 36 | 20 | 23 | 17 |
| Summer | 81 | 55 | 89 | 64 | 45 |
| Winter | 2 | 5 | 4 | 1 | 2 |
Figure 3.3 shows how the number of injuries occurred in amusement rides distributed across months and years. It is reflected in the graph that the highest number of injuries that occurred in a single month appeared in June 2015 with 41 injuries. Another interesting feature that appears in the graph is that the numbers of rides related injuries in 2014 and 2017 are relatively low compared to other years.
It may be beneficial for ride owners to focus their resources in spring and summer when number of injuries are at its height. More regular and rigorous inspections to the ride equipment can be performed during these periods. Furthermore, ride owners may also provide staff with additional training in the periods leading up to summer. It may also be advantageous to perform further study to uncover the true drivers of the following questions (which unfortunately are out of scope of this report):
Figure 3.3: Ride Related Injuries Seasonal Trends
Since the injury reporting mechanism varies from state to state, we only analysed the injuries in each park in Texas (assuming that the injury reporting mechanism is the same across parks in Texas). We want to analyse not only the injuries happened every year in each park, but also the total injuries across all the years in each park, so we also kept the injury records with missing injury_year values. Figure 3.5 shows the top 10 amusement parks which have the most injuries in Texas. We can see that the Six Flags Over Texas park has the most amusement park injuries in Texas. The Sky group Investments LLC DBA iFly Houston Memorial park had a significant large number of injuries in 2015, while the Typhoon Texas - Austin Park had lots of injuries in 2016.
Figure 3.5: Amusement Park Injuries Ranking in Texas
In this part, we will find out the manufacturers whose products caused most amusement injuries over the years. This could remind amusement parks who have devices from these manufacturers to maintain their devices carefully. Figure 3.6 shows the top 10 manufacturers whose products caused most amusement injuries over the years and the percentage of total injuries recorded in the safer_park data set.
There are 254 manufacturers’ records in this data set, and the total injuries of them is 8793, while the total injuries of the top 10 manufacturers is 2841. The top 10 manufacturers accounted for 32.3% of all amusement park injuries. It’s worth noteworthy that in-house manufactured devices accounts for more than 10% of total injuries.
Figure 3.6: Top 10 Manufacturers with Most Injuries
The following packages are used to produce this report: visdat (Tierney, 2017), dplyr (Wickham et al., 2020), readr(Wickham, Hester, & Francois, 2018), tidyverse (Wickham et al., 2019), lubridate (Grolemund & Wickham, 2011), knitr (Xie, 2014), kableExtra (Zhu, 2019), tidytext (Silge & Robinson, 2016), wordcloud (Fellows, 2018), janitor (Firke, 2020), here (Müller, 2017), plotly (Sievert, 2020), rlist(Ren, 2016)
Amusement Parks, I. A. of, & Attractions. (2018). In IAAPA RIDE SAFETY REPORT – NORTH AMERICA – 2018.
Fellows, I. (2018). Wordcloud: Word clouds. Retrieved from https://CRAN.R-project.org/package=wordcloud
Firke, S. (2020). Janitor: Simple tools for examining and cleaning dirty data. Retrieved from https://CRAN.R-project.org/package=janitor
Grolemund, G., & Wickham, H. (2011). Dates and times made easy with lubridate. Journal of Statistical Software, 40(3), 1–25. Retrieved from http://www.jstatsoft.org/v40/i03/
Gutierrez, L. (2016). Eight high-profile u.s. Amusement park deaths in recent years. The Kansas City Star. Retrieved from https://www.kansascity.com/news/nation-world/national/article94407457.html
IAAPA. (2017). Global theme and amusement park outlook 2017–2021.
Index, T., & Index/AECOM, M. (2019). In TEA/AECOM 2019 Theme Index and Museum Index: The Global Attractions Attendance Report.
Insurance, T. D. of. (2019). Amusement ride faqs. Retrieved from https://www.tdi.texas.gov/commercial/lcamuseinfo.html#reports
Millerbernd, A. (2018). Texas amusement park accidents. Retrieved from https://data.world/amillerbernd/texas-amusement-park-accidents
Müller, K. (2017). Here: A simpler way to find your files. Retrieved from https://CRAN.R-project.org/package=here
Ren, K. (2016). Rlist: A toolbox for non-tabular data manipulation. Retrieved from https://CRAN.R-project.org/package=rlist
Saferparks. (2020). Accident reports from state/federal regulators. Retrieved from https://ridesdatabase.org/saferparks/data/
Saferparks. (n.d.). In Saferparks Accident Data (pp. 2–3). Retrieved from https://ridesdatabase.org/wp-content/uploads/2020/02/Saferparks-data-description.pdf
Schneider, M. (2019). Theme park attendance crosses half-billion mark for 1st time. Retrieved from https://www.usnews.com/news/best-states/florida/articles/2019-05-23/theme-park-attendance-crosses-half-billion-mark-for-1st-time
Sievert, C. (2020). Interactive web-based data visualization with r, plotly, and shiny. Chapman; Hall/CRC. Retrieved from https://plotly-r.com
Silge, J., & Robinson, D. (2016). Tidytext: Text mining and analysis using tidy data principles in r. JOSS, 1(3). https://doi.org/10.21105/joss.00037
TDI. (2020a). Amusement ride current insurance policies. Retrieved from https://www.tdi.texas.gov/commercial/lcamusepolicy.html
TDI. (2020b). Amusement ride requirements. Retrieved from https://www.tdi.texas.gov/commercial/indexamusement.html
Tidy Tuesday. (2020). A weekly social data project in r. https://github.com/rfordatascience/tidytuesday.
Tierney, N. (2017). Visdat: Visualising whole data frames. JOSS, 2(16), 355. https://doi.org/10.21105/joss.00355
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Wickham, H., François, R., Henry, L., & Müller, K. (2020). Dplyr: A grammar of data manipulation. Retrieved from https://CRAN.R-project.org/package=dplyr
Wickham, H., Hester, J., & Francois, R. (2018). Readr: Read rectangular text data. Retrieved from https://CRAN.R-project.org/package=readr
Woodcock, K. (2014). Amusement ride injury data in the united states. Safety Science, 62, 466–474.
Xie, Y. (2014). Knitr: A comprehensive tool for reproducible research in R. In V. Stodden, F. Leisch, & R. D. Peng (Eds.), Implementing reproducible computational research. Chapman; Hall/CRC. Retrieved from http://www.crcpress.com/product/isbn/9781466561595
Zhu, H. (2019). KableExtra: Construct complex table with ’kable’ and pipe syntax. Retrieved from https://CRAN.R-project.org/package=kableExtra